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Abstract. We present a framework for selecting and developing mea- 
sures of dependence when the goal is the quantification of a relationship 
. between two variables, not simply the establishment of its existence. 

Much of the literature on dependence measures is focused, at least 
■ implicitly, on detection or revolves around the inclusion/exclusion of 

particular axioms and discussing which measures satisfy said axioms, 
pq I In contrast, we start with only a few nonrestrictive guidelines focused 

on existence, range and interpretability, which provide a very open and 
flexible framework. For quantification, the most crucial is the notion of 
interpretability, whose foundation can be found in the work of Good- 
man and Kruskal [Measures of Association for Cross Classifications 
(1979) Springer], and whose importance can be seen in the popular- 
ity of tools such as the in linear regression. While Goodman and 
^ i Kruskal focused on probabilistic interpretations for their measures, we 

' demonstrate how more general measures of information can be used to 

. achieve the same goal. To that end, we present a strategy for building 

jy-^ I dependence measures that is designed to allow practitioners to tailor 

measures to their needs. We demonstrate how many well-known mea- 
, sures fit in with our framework and conclude the paper by presenting 

two real data examples. Our first example explores U.S. income and 
education where we demonstrate how this methodology can help guide 
the selection and development of a dependence measure. Our second 
^ ' example examines measures of dependence for functional data, and il- 

^ . lustrates them using data on geomagnetic storms. 

Key words and phrases: Measures of dependence, quantification, in- 
formation metrics, functional data, interpretability, uses of dependence. 



1. INTRODUCTION 

Exploring the relationships between variables is 
one of the most fundamental tasks in statistics and 
at the heart of many statistical analyses. A com- 
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mon goal is to clearly demonstrate the existence of 
dependence between two variables. Once the exis- 
tence of a relationship is accepted or established, 
dependence measures can be used to summarize that 
relationship in an informative and concise fashion. 
They can provide deep insight into the relationships 
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between variables while being more easily commu- 
nicated than full model descriptions. For example, 
in financial portfolios, the dependence between var- 
ious assets plays a crucial role in moderating risk. 
In epidemiology, it is important to quantify the de- 
pendence between diseases and various factors to 
determine which are better predictors of risk. In 
genetics, the goal is often to measure the strength 
of the dependence between various phenotypes and 
genotypes to gauge which are the most important 
biological pathways in the risk architecture of com- 
plex traits. The dependence between genetic mark- 
ers plays a significant role in the design of associa- 
tion studies. In any field where statistical procedures 
are applied, the ability to quantify dependence in an 
interpretable fashion can be crucial. Unfortunately, 
the proliferation of hypothesis testing has steered 
the development of dependence measures away from 
interpretability. Many modern measures are devel- 
oped with the goal of catching any trace of depen- 
dence in any form, with less focus on the inter- 
pretability of their measures beyond the extreme 
values of and 1. Examples of such measures that 
motivated our current work include the distance cor- 
relation (Szekely, Rizzo and Bakirov (2007); Szekely 
and Rizzo (2009)), the maximal information coeffi- 
cient, MIC (Reshef et al. (2011)) and copula based 
measures (Schweizer and Wolff (1981); Siburg and 
Stoimenov (2010)). Such measures are exciting new 
tools for the detection of nonlinear relationships, but 
are difficult to interpret at intermediate values. The 
inability to interpret a measure of dependence is not 
necessarily detrimental to an analysis, but it limits 
its use as a summary tool and effectively isolates 
its utility to the realm of hypothesis testing or the 
detection of dependence. 

The goal of the present work is to develop a frame- 
work for dependence measures when the primary 
task is the quantification and summarization of a 
relationship between two variables, not just the es- 
tablishment of its existence. We designed this frame- 
work with the aim of (a) helping practitioners decide 
on an appropriate dependence measure and consider 
more nonstandard measures if applicable, (b) guid- 
ing the development of new measures of dependence 
with interpretability as a priority, and (c) starting a 
discussion challenging current views on dependence. 
The central idea of the methodology is to build a 
dependence measure by first constructing an appro- 
priate measure of information, determined by the 
practitioner and the setting, and then using that 
measure of information to quantify dependence. Al- 



ternatively, for a preexisting measure, an interpre- 
tation can be developed if one can find an informa- 
tion function embedded within it. More succinctly, 
we adopt the view that measuring dependence in 
an interpretable way is, in fact, about measuring 
the amount of relevant information one variable con- 
tains about another. 

To elucidate this dichotomy, and thus the need for 
our framework, consider the high frequency data ex- 
ample on geomagnetic storms. We will present this 
example with greater depth later on, but under- 
standing these storms has become very important 
(see, e.g., Moskowitz (2011)), as they can have dam- 
aging effects on GPS, satellite, radar and data stor- 
age technologies. In that example we measure the 
dependence between storms at different locations 
on the earth with the goal of determining predic- 
tive capability and explained variability. Since the 
storms are driven by solar wind, they are obviously 
dependent, thus making a generic measure with no 
interpretation of little use. 

The notion of using information to measure de- 
pendence has been studied extensively in the in- 
formation theory literature and we reference (Ash 
(1990); Cover and Thomas (2006); Ebrahimi, Soofi 
and Soyer (2010) and Grey (2011)), to name only 
a few. Our perspective differs from the information 
theory literature in two distinct ways. First, we do 
not attempt to determine universally applicable or 
ideal information functions and, in particular, we do 
not focus extensively on entropy, though it will fall 
naturally within the proposed framework. Second, 
we distinguish sharply between the detection of de- 
pendence and its quantification. Only in the case 
of quantification do we insist on the importance of 
an information function. More classical methods on 
dependence measures go all the way back to Renyi 
(1959) where he outlines a set of mathematical ax- 
ioms that dependence measures should satisfy. Renyi's 
axioms have been modified in many various ways 
(see, e.g.. Bell (1962); Hall (1970); Schweizer and 
Wolff (1981) and Nelsen (2010)), but there is little 
discussion of one crucial property: dependence mea- 
sures intended for quantification should have clear 
interpretations associated with them. Most of the 
common axioms placed on dependence measures are 
less relevant in such a context. Indeed, the only 
body of literature we could find developing similar 
ideas was the seminal work of Goodman and Kruskal 
(1979), where they meticulously develop measures 
with probabilistic interpretations. Our goal is simi- 
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lar to theirs, but we achieve it in markedly different 
ways. 

The paper is organized as follows. In Section 2 
we outline our framework for developing measures 
of dependence based on information functions. In 
Section 3 we explore many examples of dependence 
measures that fall nicely into the proposed frame- 
work. In Section 4 we illustrate the importance of 
this framework in two different real world settings. 
The first explores the relationship between income 
and education and, in particular, explores how the 
information relevant to the problem should guide 
the choice of measure. The second application in- 
volves the analysis of geomagnetic storm data and 
the relatively new area of functional data analysis. 
We show how the ideas presented here can guide 
in the development of new interpretable measures 
of dependence in that area. We conclude the paper 
with a discussion in Section 5. 

2. FRAMEWORK 

An essential starting point in considering any de- 
pendence measure is first examining its intended 
use. Though there may be many creative applica- 
tions for dependence measures, three of the most 
significant ones are the following: 

(1) detection: detecting dependence in any form; 

(2) ranking: ordering the dependence in different 
relationships (e.g., model selection); 

(3) quantification: summarizing a relationship in 
an informative fashion. 

A similar, though dichotomous, breakdown was noted 
by Lehmann (1966). 

In the first setting, that a dependence even exists 
is sometimes questionable. Thus, a measure leading 
to a valid and powerful testing procedure would be 
most desirable. As a simple illustration, suppose a 
researcher was examining the relationship between 
income and height. In such a situation, there is no 
clear reason, a priori, that the two variables should 
be dependent. Thus, first using a measure designed 
for dependence detection might be appropriate. Mea- 
sures such as correlation can be used to detect linear 
dependence, while the distance covariance or MIC 
can be used for nonlinear relationships. 

In the second setting, the main goal is to deter- 
mine which relationships are the strongest. For ex- 
ample, we may wish to rank or select several vari- 
ables that best explain income. Thus, we would aim 
to choose a subset with the highest dependence. In 



such a setting statistical power and interpretability 
are not necessarily the primary concern. The MIC, 
for example, attempts to establish a useful "equi- 
t ability" property that assigns similar values to re- 
lationships with similar noise levels, regardless of the 
functional nature of the relationship. 

In the third setting, the existence of a dependence 
is either obvious or already well established. There, 
it would be more important to quantify that depen- 
dence in a meaningful way. Using a similar simple 
illustration, there is a clear, well-established depen- 
dence between income and education. Thus, using a 
measure designed solely for detection would be un- 
productive, while utilizing a carefully chosen mea- 
sure with a clear interpretation could provide a great 
deal of insight that could be easily communicated to 
others. 

We present our framework as a means of evalu- 
ating and developing measures of dependence when 
the goal is the quantification of a dependence. We 
start by laying out guidelines or general properties 
that dependence measures should satisfy in such a 
setting. We then outline our method, based on in- 
corporating information functions, to demonstrate 
how to satisfy those guidelines. 

Guidelines for Quantification 

In contrast to Renyi, we propose only three guide- 
lines instead of six axioms: 

(1) existence: the measure should exist for a large 
collection of random variables, vectors and/or func- 
tions, including those relevant to the analysis; 

(2) range: the range of the measure should be 

[0, 1]; 

(3) interpretability: the measure should have a clear 
interpretation, for all possible values, based on in- 
formation content. Furthermore, should represent 
"no information," while 1 represents "complete in- 
formation." 

The difficulty in insisting on mathematical axioms is 
that "interpretability" is impossible to define math- 
ematically, and yet is the most crucial property. Our 
guidelines are designed to induce a rather malleable 
framework that can easily adapt to the needs of the 
researcher and the setting. 

The first guideline simply indicates that the mea- 
sure should be applicable to any variables that one 
may come across in their analysis. There is no reason 
why all measures should exist for all random vari- 
ables, or even all variables with some specific struc- 
ture. The main concern should be that the measure 
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is at least well defined for the possible variables that 
may arise in the analysis. The second guideline sim- 
ply creates a standard range of values for all mea- 
sures. Since every measure should have a concept of 
"most dependent" and "least dependent," it makes 
sense that the range be a bounded interval, and us- 
ing [0, 1] is fairly standard. Obviously some mea- 
sures such as the correlation can take negative val- 
ues. However, only in the univariate case where the 
relationships between variables are monotone does 
having a signed measure make sense. In that case, 
nearly all signed measures will have the same sign. 
So, for example, one could use correlation or signed 
rank correlation to better understand the directional 
relationship of the variables, while still using an- 
other [0, 1] measure to quantify the magnitude of 
the dependence in a relevant and interpretable way. 

The essential guideline for quantification is "inter- 
pretability." Nearly every dependence measure has 
a fairly clear interpretation at its extreme values. In 
fact, a large emphasis is usually placed on the inter- 
play between complete independence/dependence 
and the extreme values of the measure. However, we 
assert that not only should and 1 have a clear in- 
terpretation, so should every value in between. Fur- 
thermore, we claim that measuring dependence in 
an interpretable way is really about measuring the 
amount of relevant information one variable con- 
tains about another. 

Noticeably absent are axioms/properties such as: 

• zero dependence implies statistical independence, 

• symmetry, 

• invar iance, 

• equivalence to absolute correlation in the joint 
normal setting. 

The first property is important when detecting po- 
tentially nonlinear relationships, but is not neces- 
sary for quantification, especially if the interpreta- 
tion of the measure is highly relevant. By symmetry, 
we mean that the dependence between X and Y is 
unchanged if the two are swapped. Models are of- 
ten not symmetric, and there is no reason to insist 
that all dependence measures should be. However, 
we will discuss a potential method for symmetrizing 
in the next section. Invariance means that one-to- 
one transformations of X and/or Y do not change 
the value of the measure. However, the scale of Y 
plays a crucial role in measures such as the correla- 
tion and correlation ratio. Again, there is no obvious 
reason why all measures should have such a prop- 
erty. 



From Information to Dependence 

Incorporating the interpretability property is a 
challenging task. The solution we adopt for building 
interpretability into a measure is based on treating 
the quantification problem as a user-specified infor- 
mation content exercise. In particular, we introduce 
what we call an information link function that mea- 
sures the amount of important information, as de- 
termined by the practitioner and the setting, one 
variable contains about another. Then a practitioner 
could either build their own information link func- 
tion or select a predefined information link function 
that emphasizes the priorities of their analysis. It is 
important to note that even if a measure is based on 
an information function, one still needs to carefully 
examine the type of information to determine if it 
is relevant. This is not a trivial task in applications, 
but should be a main consideration in evaluating a 
dependence measure. We will use I{X, Y) to denote 
the value of our information function evaluated at 
X and Y, read as "the amount of information X 
contains about Y." 

Definition 1. Let T2 C -Fi be two collections 
of random variables, vectors and/or functions. We 
say that a function / is an information link function 
over Ti x T2 if: 

(1) I:J^i X J-2^M+; 

(2) I{X;Y) < I{Y;Y), for any X e Ti and Y G 
with I{X; y) = if they are independent; 

(3) if, for any X and Z in J^i, there exists a func- 
tion, /, such that Z = f{X), then I{Z; Y) < I{X; Y) 
for every y € • 

The first property simply indicates that informa- 
tion is a nonnegative quantity. The second property 
indicates that a variable must contain the maxi- 
mum amount of information about itself and that 
independent variables contain no information about 
each other. The third property is a type of mono- 
tonicity and indicates that if one variable completely 
determines another, then it must also contain more 
information. A consequence of the third property 
is that information link functions are invariant un- 
der one-to-one transformations of the first argument 
(assuming the transformation is in J-i), but not the 
second. Such a property is reasonable, as the scale 
of Y can be crucial in determining the scale of the 
measured information (such as explained variance), 
however, one should not be able to obtain "more in- 
formation" about Y by simply transforming the X 
variable. 
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In contrast to the suggested guidelines, the infor- 
mation definition has no interpretation requirement 
hsted, as articulating such a property mathemat- 
ically is all but impossible. Instead, it is the re- 
sponsibility of the researcher to determine if a given 
information function has an interpretation relevant 
to their analysis. Furthermore, those that introduce 
new measures of dependence or information func- 
tions should take care to explore and develop their 
possible interpretations. 

Once a suitable information function is constructed, 
a dependence measure is easily obtained via scaling. 
Define 

then D satisfies the following: 

(1) x7-2^[0,l]; 

(2) D{Y-Y) = 1; if X and Y independent, then 
D{X-Y)=Q; 

(3) if, for any X and Z in J-*!, there exists a 
measurable function, /, such that Z = f{X), then 
D{X; Y) > D{Z; Y) for every Y € T2] 

(4) D{X;Y) is invariant under one-to-one trans- 
formations of X that stay in J-i; 

(5) built-in interpretability as a reduction or frac- 
tion of information. 

Therefore, D will satisfy all of our desired guide- 
lines for dependence measures and the task is re- 
duced to determining an appropriate information 
link function. Ideally, such a function will be de- 
termined on a case-by-case basis as the practitioner 
and setting dictate what information is of greatest 
importance. 

Note that if symmetry of the measure is desired, 
there are at least two potential methods of accom- 
plishing it. However, for symmetry to be coherent in 
our setting one would need to insist that Ti = T2 so 
that juxtaposing the variables makes sense. At that 
point, one could symmetrize by either averaging the 
resulting D{X\Y) and D{Y\X^ or, more interest- 
ingly, by using an arithmetic mean 

1{X-Y)^1{Y-X) 



miX-Y)-- 
or a geometric mean 



Dl(X-Y) 



I(X;X)+I{Y;Y) 



l l{X;Y)xIiY;X) 
IiX;X)xI{Y;Y)- 



nominators above represent the "total information" 
in the joint distribution. 

3. EXAMPLES 

We provide three examples that fall naturally into 
the proposed framework. The first two, reflecting 
prediction and statistical efficiency, are common in 
statistics and actually constitute a large class of ex- 
amples. The third example, entropy, is more com- 
mon in information theory, but fits nicely into this 
framework as well. 

Prediction 

One of the most common usages for exploring de- 
pendence is in the prediction of or explaining the 
variability of a particular random variable. For such 
a goal, we can start by building an information link 
function that quantifies how knowing the value of 
one variable increases the ability to predict another. 
As quantifying predictive capability depends heav- 
ily on how one measures loss, we keep the setting 
fairly general. 

Let 5 be a nonnegative penalty function, such that 
(7(0) = 0; for example, g{x) = would yield the 
usual 1? prediction and g{x) = \x\ the usual pre- 
diction. We start by defining an optimal predictor 
of Y based on X. Since we will restrict our measure 
to J^i X J"2i we only consider predictors of Y that 
are contained in J^i. We assume that J^i at least 
contains all of the constant values, that is, whatever 
space Y is taking values in is included in Ti. So 
define, for X taking values from X and Y from y, 

Y{X)=( argmin E[g{Y - f{X))]){X), 

that is, we choose a function of X that best pre- 
dicts y, but also falls into T\. See the Appendix for 
discussion on the existence of such an estimate. We 
define Iq to be the best constant predictor of Y . We 
can then quantify the increase in predictive capabil- 
ity by examining the difference 

I{X- Y) = E{a{Y - Fo)] - E\g{Y - Y{X))\. 

The details showing that the above is a valid infor- 
mation link function can be found in the Appendix. 
The resulting measure of dependence would then be 
1{X-Y) 



D{X;Y) 



I{Y-Y) 

E[g{Y - %)] - E[g{Y - Y{X))] 



In which case the measures could be interpreted as 
a kind of average reduction in information. The de- 



E[g{Y - Fo)] 

which can be interpreted as either the increase in 
predictive capability, being no increase and 1 im- 
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plying Y is completely determined by X, or as the 
proportion of "g-variability" of Y explained by X. 

For example, if g{x) = , then the optimal pre- 
dictor of Y based on X is just The measure 
of dependence then becomes 

r.(yy. E[{Y-E[Y]f]-E[{Y-E[Y\X]f] 
^ E[{Y-E[Y]Y] 

_ \&i{E[Y\X\) 
Var(y) ' 

which is just the well-known correlation ratio. Fur- 
thermore, if we assume that the relationship be- 
tween X and Y is linear, then the above also equals 
the square of correlation. 

Statistical Efficiency 

Another common statistical setting concerns what 
we call statistical proxies. Such objects arise when 
there is an "optimal" data set for inference on a par- 
ticular parameter which is, for whatever reason, not 
available and one has to use either the optimal set 
with missing or coarsened values, or possibly a com- 
pletely different data set intended to be a substitute 
for the optimal one. Thus, we call the observed data 
set a statistical proxy for the parameter of interest 
if the information it contains about the parameter 
is redundant when the "optimal" set is also known. 
So missing data and coarse data are special exam- 
ples when the observed data set contains only redun- 
dant information (as compared to the complete data 
set) on the parameter of interest. While observed 
variables are usually statistical proxies for the com- 
plete ones, there are examples that are not based 
on missing data from an optimal set. These include 
censored data, rounded measurements and examples 
where variables of interest cannot be observed and 
are replaced in inference by correlated variables. An 
example, discussed in more detail at the end of this 
section, is that of genetic association studies where 
causal variants are detected by testing well-selected 
genetic markers. 

So we wish to develop a measure of dependence 
when the goal is to quantify how effective a statisti- 
cal proxy one variable is for another. The measures 
of dependence need to be tailored by the type of 
statistical inference that is performed. We start by 
defining, mathematically, what we mean by a statis- 
tical proxy for a parametric model. 

Definition 2. We say that X is a statistical 
proxy of Y for a parameter if y is sufficient for 



6 with respect to the joint distribution C{X,Y]9), 
that is, C{X\Y;9) is almost surely constant with 
respect to 9. 

This definition implies that X contains only re- 
dundant information for if y is known, though it 
gives no indication as to the merits of X as a proxy. 
For example, in a missing data problem, this is ob- 
viously true as long as the missing data mechanism 
does not depend on 9, however, all that is required 
is that the missing data contains only redundant in- 
formation for 9 when the full data set is observed, 
which is a fairly mild assumption in most settings. 

Performance of most likelihood based methods for 
estimation and hypothesis testing can be evaluated, 
at least asymptotically, by the Fisher information. 
For example, the asymptotic variance of the max- 
imum likelihood estimator is monotonically related 
to the Fisher information, and so is the noncentral- 
ity parameter that drives the power in the likelihood 
ratio test. So let J-2 = {y} and J-i be all proxies of 
y. If the goal is to analyze the efficiency in using X 
for inference on 9, then a natural measure of infor- 
mation is the Fisher information for 9 based on X: 

I{X-Y)=lx{9). 

The details showing that the above is a valid infor- 
mation link function can be found in the Appendix. 
Since X is a proxy for Y , it can easily be shown that 
^x(^) Our measure becomes 

which for estimation can be interpreted as the in- 
crease in variability of the estimate or the decrease 
in accuracy when using X in place of Y . For hy- 
pothesis testing it can be interpreted as the loss in 
power. In both cases, the measure indicates the rel- 
ative efficiency in inference about 9 when using X 
compared to Y . It is important to note that in the 
context of missing data, the above measure is also 
closely related to the rate of convergence for the EM 
algorithm (see Dempster, Laird and Rubin (1977), 
for details). 

As a very simple example, consider the case where 
y is normal with mean /i and variance cr^, and we 
are interested in estimating /i. However, suppose we 
only observe X which is Y with probability p and 
missing (coded as 0) with probability 1 — p. Put an- 
other way, X = YZ, where Z is Bernoulli with pa- 
rameter p and is independent of Y . Such a setting 
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is usually called missing completely at random or 
MCAR, and the missing data mechanism is free of 
the parameter of interest /x. While we could tech- 
nically compute the correlation, coding a missing 
value as was completely arbitrary, thus, our mea- 
sure should not depend on that choice. As can be 
found in most graduate level statistics text books, 
the Fisher information for fi with respect to Y is 
l/o"^. The Fisher information for /i with respect to 
X is given by p/a'^ (see the Appendix for details). 
Thus, in this special case, the dependence measure 
becomes 

D{X;Y)=p. 

This is the well-known expected proportion of ob- 
served values, but it also represents the relative ef- 
ficiency in estimating /j, when using the X observa- 
tions in place of Y observations. 

There are situations where the main interest is 
in hypothesis testing, and useful information link 
functions need to reflect easily interpretable metrics 
such as the sample size necessary for achieving some 
given type 1 and type 2 error rates. For example, in 
genome-wide association studies, we only have data 
for single nucleotide polymorphisms (SNPs) avail- 
able on a particular genotyping array. Thus, if a SNP 
is causal for a particular disease, but not on the ar- 
ray, its signal could potentially be missed. However, 
if there is an arrayed SNP highly correlated (called 
in linkage disequilibrium or LD) with the causal 
SNP, then it can be used as a proxy for the causal 
variant. The design of a genetic association study 
takes advantage of the dependence between SNPs 
and of knowledge on how this dependence affects the 
power of detecting associations. This knowledge is 
quantified in a measure of dependence/LD, denoted 
by r^, that has a clear interpretation: it is approxi- 
mately equal to the ratio of sample sizes that leads 
to the same power when using the causal versus the 
genotyped SNP. For example, suppose we would like 
to identify, in a candidate gene study, if there exists 
a causal variant with a given effect size. We can eas- 
ily perform a power calculation for a causal SNP 
that can specify the sample size, rei, needed for de- 
tecting it. Available are n2 samples (with n2 > ni), 
and the interpretability of can help us in selecting 
an optimal genotyping design: choose the minimum 
set of SNPs such that, for all SNPs in the gene, 
there exist one in this set with pairwise > /n2 . 
Note that measures based on the idea of asymptotic 
relative efficiency (ARE) are not the only way one 
could design interpretable functions in hypothesis 
testing. For example, one can use elements in the 



distribution of the likelihood ratio statistic to quan- 
tify impact on power (see Nicolae, Meng and Kong 
(2008); Reimherr and Nicolae (2011)). 

The idea of sample size as a measure of informa- 
tion for exchangeable (e.g., i.i.d.) data could be a 
powerful tool for translating attributes of joint dis- 
tributions to applied scientists. There are many sit- 
uations where the interest is in observable claims 
(length of confidence intervals, precision of estima- 
tion, power of a statistical test, etc.) on the marginal 
distribution of Y , that is, in objects that can be 
calculated from the distribution function of Y . For 
example, we could be interested in a quantile of Y 
(the percentage of households in a city with annual 
income larger than $250K) and we would like to ex- 
press that with a narrow confidence interval (length 
smaller than one percent). Using information on the 
distribution of Y (e.g., national data for income), we 
can predict the sample size necessary for the needed 
claim, the issue being that we could collect data only 
for a proxy, X (such as the tax rate for the house- 
hold). Obviously, we need a larger sample size to 
obtain the same width for a confidence interval, and 
the ratio of these two sample size offers an easily 
interpretable measure of dependence. 

Entropy 

Entropy is widely used in the information the- 
ory literature as the primary measure of information 
content in a random variable (see, e.g.. Cover and 
Thomas (2006)) and arises in the statistics litera- 
ture as the expected value of the log likelihood. The 
interpretability of the entropy is a bit questionable 
except for in some very specific circumstances, but 
it is nevertheless very popular in the field of infor- 
mation theory. Thus, it may be reasonable that a 
practitioner would choose entropy as the measure of 
information they are concerned with, and, in partic- 
ular, how knowing the value of one variable reduces 
the entropy in another. 

The entropy of a random variable Y — for simplic- 
ity, we assume that Y is discrete — with probability 
mass function /y is defined as 

F(y) = -E[iog(/y(y))]. 

And the conditional entropy of Y given X is 

H{Y\X) = -E[E[log(/y|;,(y|X))|X]], 

which indicates the entropy of Y given X, averaged 
over X. Thus, using the reduction in entropy of Y 
by knowing X as our measure of information gives 

I{X-Y) = H{Y)-H{Y\X), 
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which is commonly called the mutual information 
in Y about X. The details showing that the above 
is a valid information link function can be found in 
Cover and Thomas (2006). This yields the depen- 
dence measure 

H{Y)- H{Y\X) 



D{X-Y) 



H{Y) 



which is easily interpreted as the proportional re- 
duction in entropy of Y by knowing X (see also 
Ebrahimi, Soofi and Soyer (2010)). It is maybe in- 
teresting to note that while the mutual information 
is symmetric, the above dependence measure is not. 

4. APPLICATIONS 

In this section we discuss applications to income 
and education, and geomagnetic storms. While these 
examples are more focused on prediction, additional 
examples involving statistical efficiency and miss- 
ing information in genetic association studies can 
be found in Nicolae (2006), Nicolae, Meng and Kong 
(2008) and Reimherr and Nicolae (2011). 

Income and Education 

One of the driving motivators of pursing education 
is the potential for higher income. A vast amount 
of data and reports exploring their relationship can 
be found on the website for the Department of La- 
bor Statistics www.bls.gov. In this example we will 
explore the differences between men and women in 
terms of how their incomes are affected by educa- 
tion levels. Furthermore, we will demonstrate how 
the choice of dependence measure can play a crucial 
role in understanding and communicating that dif- 
ference. The data we explore here consists of approx- 
imately 1.25 million individuals living in the U.S. 
aged 25 and over and receiving an annual income 
(see http : //f actf inder . census . gov/home/ en/ 
acs_puins_2009_lyr .html for further details). 

How to choose a dependence measure for such a 
setting is not completely obvious. Classical choices 
include correlation (assuming education is measured 
quantitatively) and the correlation ratio in conjunc- 
tion with a generalized linear model. However, indi- 
viduals often strive to hit certain income thresholds. 
At low levels of income, individuals may try to make 
it out of poverty or above minimum wage. Thus, a 
very meaningful question would be, how does edu- 
cation affect the chances of making it past a certain 
income threshold? For this example, we will use a 
threshold of $35,000, the approximate median in- 



come of U.S. adults over 25 years of age (see U.S. 
Census Bureau (2010)). 

Let Y be an indicator variable equaling 1 if an 
individual's personal income is over $35,000 and 
otherwise. Let X be the education level of that indi- 
vidual, equaling 0, 1, 2 and 3 representing education 
levels of "less than high school," "high school degree 
or equivalent," "bachelor's or associates degree" and 
"higher degree," respectively. See Efron (1978) for 
a discussion on dependence measures for binary re- 
sponse variables. While the literature on binary data 
is quite large, we also cite Goodman and Kruskal 
(1979), McCuUagh and Nelder (1989), Lipsitz, Laird 
and Harrington (1991) and Liang, Zeger and Qaqish 
(1992) and the references therein. The last two ref- 
erences deal especially with odds ratios which we 
have not touched on here. Here we will compare 3 
different measures of dependence: the correlation ra- 
tio (sometimes called Efron's in this setting), the 
ratio of reduction in deviance and the ratio of reduc- 
tion on 0-1 prediction error. The first measure might 
be considered the most natural generalization of the 
from linear regression and gives the reduction 
in prediction error. The second measure is com- 
monly used in the theory of generalized linear mod- 
els (see McCullagh and Nelder (1989)). The third 
measure is natural because Y takes discrete values, 
so we can ask what the probability is that we in- 
correctly predict Y. As an aside, the measure 
can also be viewed as the reduction in predic- 
tion error, as that measure will coincide with D-^ in 
this case. Let Yi,...,Yn be the binary incomes for 
the n individuals in the data set, and let Xi, . . . , X^ 
be their corresponding education levels. Define two 
fitted values as % = E\Yi\Xi] and Yi = !{% > 0.5}. 
Also define Y = n-'^Y1 and Y = I{Y > 0.5} which 
correspond to the unconditioned fitted values. Then 
the measures of dependence can be expressed as 



correlation ratio: 



DiiX,Y) = l 



j:(y^-y,] 



EiY^-Y)^' 
deviance ratio: 
D2iX,Y) 

= l-^[YdogiY,/Yi) 

+ (i-y,)iog((i-y,)/(i-y,))] 

/^[Y,log{YjY) 

+ (i-y,)iog((i-^.)/(i-n)], 
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Table 1 

Income and education measures of dependence. Di , D2 and 
Ds correspond to the correlation ratio, the defiance ratio and 
the 0-1 loss ratio, respectively 



Table 2 

The probabilities of making more than the median 
income given the education level X for males and females. 
Note p{y\x) = P{Y = y\X = x) 



Gender 




£>2 


Ds 


Gender 


p(l|0) 


P(l|2) 


P(l|2) 


P(l|3) 


PiY = 1) 


Male 


0.1092 


0.0843 


0.1171 


Male 


0.2719 


0.5310 


0.7404 


0.8351 


0.6046 


Female 


0.1522 


0.1183 


0.2329 


Female 


0.0836 


0.2761 


0.5462 


0.7445 


0.4140 



0-1 ratio: 



DsiX,Y) = l 



Notice that in this context, the measure has a 
very useful and relevant interpretation. Since we are 
looking at an income threshold of $35,000, we are 
especially interested in how well education predicts 
being over or under that threshold. gives a literal 
measure of reduction in prediction error that most 
people can understand: if = 0.30, then knowing 
a person's education level decreases the chances of 
incorrectly predicting them above/below $35,000 by 
30% (as compared to using the population average). 
The measure D2, on the other hand, is very diffi- 
cult to interpret. It gives a reduction in a log type 
penalty, but it is difficult to give it much more of an 
interpretation (although for statisticians they can 
view it as a reduction in the expected log likelihood). 
The measure Di has a bit more of an interpretation 
and closely resembles the regression i^^ but con- 
sidering the discrete nature of y, it is difficult to 
explain why one should be especially interested in 
an type loss. 

The fitted values are computed using the full model 
and we compare the different measures in males 
and females. The results are summarized in Table 1. 
Note that even the smallest gender /income/education 
group has over 80,000 individuals, making all model 
estimation error essentially negligible. As we can see 
from the table, each measure differs across gender, 
however, the magnitudes of the measures are quite 
different. While Di and D2 are approximately 40% 
higher for females than males, D3 is almost 100% 
higher. 

To further understand these relationships, con- 
sider the conditional probabilities P{Y = 1\X) for 
different levels of education as given in Table 2. The 
difference between men and women is fairly remark- 
able, especially at lower education levels. For those 
with less than a high school degree (X = 0), men 
are more than twice as likely than women to make 



more than the median income. This trend levels off 
at higher education levels, as among those with a 
graduate level degree {X = 3), men are only 12% 
more likely than women to make more than the me- 
dian income. 

The differences between men and women in re- 
gards to the income/education relationship is re- 
markable. The measure of dependence D3 picks up 
this difference more clearly than Di and D2 and has 
a very relevant interpretation as the reduction in 
literal prediction error. After seeing the conditional 
probabilities in Table 2, it is easy to understand why 
education plays a larger role for women than in men; 
by attaining higher levels of education, women are 
able to significantly lower this income gap compared 
to men. 

Geomagnetic Storms 

Here we present an example of how our methodol- 
ogy can help guide the development of new measures 
of dependence. The magnetosphere of the earth forms 
part of the exosphere, the earth's atmosphere's out- 
ermost layer. Solar wind emitted by the Sun is di- 
rected around the earth by the magnetosphere, but 
the interaction of the two generates a tremendous 
amount of electrical current and electromagnetic ac- 
tivity. Solar flares can generate strong geomagnetic 
substorms, an example of which is the Aurora Bore- 
alis. Particles from the solar wind make it to the in- 
nermost layer of the magnetosphere, called the Iono- 
sphere, and ionize the gases, causing an amazing 
display of light. Substorms typically last one or two 
days and can be very disruptive to global position- 
ing systems and radio and radar technologies that 
bounce their signals off the the ionosphere, as well as 
outright damaging satellites, power grids and data 
storage technologies. This topic has gained a great 
deal of attention recently, as we are approaching the 
peak of the solar magnetic activity cycle. 

Understanding the nature of these storms is an 
important goal, but one made difficult by the fact 
that the magnetosphere is too low for satellites, but 



10 



M. REIMHERR AND D. L. NICOLAE 



too high for aircraft. To that end, INTERMAGNET 
is a network of terrestrial observatories that mon- 
itor eletromagnetic activity in the magnetosphere 
and attempt to provide almost real time data on 
the geomagnetic activity at their location. A large 
scale analysis of their data is far beyond the scope of 
this paper. Instead, we focus on measurements taken 
in College (coded as CMO), Alaska and Honolulu 
(coded as HON), Hawaii in 2001. The data consists 
of 120 days where storms occurred, taken from Jan- 
uary through September. We further separate the 
data into three sets of 40 pairs. The first set consists 
of storms in January through March, the second set 
April through June, and the third set July through 
September. On each day, 1440 equally spaced mea- 
surements are given, making more traditional meth- 
ods difficult to apply. Each value is measured in 
nanoteslas and only indicates the strength of the 
horizontal component of the magnetic field. Time is 
measured in terms of Universal Time (UT) . We view 
each day as a single functional observation because 
of the daily rotation of the earth. By "functional 
observation" we mean that we treat the curve from 
a particular day as an observation from a random 
function taking values in a function space. Such an 
approach is commonly called functional data analy- 
sis or FDA for short. 

A long term goal would be the ability to predict 
future storm activity at different locations on the 
globe, using data from other stations. For example, 
since the substorms are driven by the sun, a sta- 
tion's storm activity usually dies out a night. Thus, 
we may potentially use stations currently facing the 
sun to predict the next days storm activity for sta- 
tions currently facing away from the sun. So we 
build measures based on prediction, that is, how 
well we can predict the storm activity in one station 
(HON) given that we observe the activity at another 
(CMO). A larger analysis would use data from mul- 
tiple stations as well as taking care with the differing 
time zones (CMO is only two hours ahead of HON, 
so one really cannot use an entire CMO day to pre- 
dict a HON day) , but our approach will be sufficient 
enough to illustrate our dependence framework. 

We assume that Y and X are random functions 
taking values from L^[0,1], representing the entire 
curve of values measured in HON and CMO, respec- 
tively, on a particular day. As was said, the informa- 
tion we are concerned with is prediction. Thus, 
we can take the reduction in prediction as our 
measure of information 

I{X;Y) = E\\Y - E[Y]f - E\\Y - E[Y\X]f . 



We should note that above Y,X,E[Y] and ^[^l^] 
are all functions and || • || is the functional norm 
(the domain depends on how you parametrize time 
over a day, but is usually taken to be [0, 1] for sim- 
plicity) . And we arrive at a kind of functional version 
of the correlation ratio 

E\\Y - E[Y]r - E\\Y - E[Y\XW 
^^^^'^^ - E\\Y-E[YW ■ 

As in the univariate case, Di can be interpreted as 
explained variability or reduction in prediction 
error. This measure is given (Ramsay and Silverman 
(2005)) in the context of a goodness-of-fit measure 
for the functional linear model (which we will use to 
estimate the measures here). 

While Di has a very nice interpretation, one pos- 
sible concern could be that Di will be heavily in- 
fluenced by coordinates of Y with high variability. 
This can be viewed positively in certain situations, 
as more variability means that there is more to ex- 
plain. But, as with our example here, the variability 
will naturally change depending on the time of day, 
as the magnetic activity is driven by the sun. Thus, 
we may want a measure that takes the changing vari- 
ability into account and is not as influenced by the 
more variable coordinates. With that in mind, we 
propose two additional measures of dependence: 

D2{X;Y) 

E\\{Y-E[Y])/Sp-E\\{Y-E[Y\X])/Sp 



l-E 



Y 



E\\{Y- 
E[Y\X] 



S 



E[Y])/S\\^ 

2 



D3{X-Y) 



1 



1 



E[{Y - E[Y\X]f^-^{Y - E[Y\X])] 

E[{Y - S[Y])^S-i(Y - ^[Y])] 
E[{Y - E[Y\X])^J:-\Y - E[Y\X])] 
d 



where S{t) = (Var(y(t))V2, Y G R'^ is the projec- 
tion of Y onto the d most significant principal com- 
ponents, and S is the variance-covariance matrix 
of Y. Here, D2 can be interpreted as the explained 
variability averaged over time and it simplifies in the 
above way because the time interval is [0, 1] (other- 
wise it would be scaled by the length of the interval) . 
Notice that this is an average of the coordinate-wise 
measure given in Ramsay and Silverman (2005). The 
third measure, -D3, which we have not seen in previ- 
ous FDA literature, utilizes a principal component 
analysis to project the data onto a finite dimensional 
setting, denoted Y, and then computes a multivari- 
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Table 3 

The FDA measures of dependence for magnetic storm data 
evaluated over different seasons. Here X represents 
Honolulu, Hawaii and Y represents College, Alaska 



Period 


Jan.-March 


April-June 


July-Sept. 


Di{Y,X) 


0.748 


0.576 


0.594 


D2{Y,X) 


0.597 


0.569 


0.453 


D3iY,X) 


0.511 


0.582 


0.427 



ate goodness-of-fit measure there. Again, the effect 
now is that the measure gives an average goodness of 
fit, but this time averaged over the principal compo- 
nents. This interpretation becomes especially clear 
when one takes into account that S is in fact a di- 
agonal matrix. Commonly, one chooses the number 
of principal components so that a large percentage 
of the variability, say, 85-95%, is explained by the 
PCs. When choosing the number of PCs for X one 
could also use a cross-validation in terms of pre- 
dicting Y. In our examples we always choose the 
number of PCs such that 85% of the variability is 
explained. Both D2 and D3 attempt to "average 
out" larger components that might otherwise dom- 
inate the measures, but they do it in very different 
ways. The measure D2 averages the dependence over 
time, thus smoothing out more variable time peri- 
ods, while D3 averages over the components so that 
larger components do not completely dominate the 
measure. The appropriateness of the measures will 
depend on the setting, though in most cases Di will 
be very natural. 

Table 3 gives estimates of the dependence mea- 
sures over the three different seasons. The story 
changes slightly depending on the measure, which 
makes the interpretations all the more relevant. The 
first measure is strongest in January through March 
and weaker in the other two seasons. The second 
measure decreases with each season, with a much 
larger drop moving from the second to third season. 
The third measure indicates that the dependence 
is actually strongest in April through July, though 
agrees with D2 in that the July through Septem- 
ber dependence is weakest. Since averages over 
principal components, the implication would be that 
the fit for the first component is much better in the 
January through March storms (since Di is so much 
larger for that season) as compared to the other two 
seasons, while the second season has a better fit for 
the later components. 

Our findings go a step beyond ranking of the 
strength of the dependence across seasons. Since 



our dependence measures were built upon a mea- 
sure of information, each number in Table 3 also 
has deeper meaning beyond ordering. Here the first 
measure can be interpreted as the percentage of 
variability in one station explained by observing an- 
other. For example, in January through March, al- 
most 75% of the variability in a Hawaiian storm is 
accounted for by observing the corresponding storm 
in Alaska. What this means for scientists is that, 
after taking into account a storm in Alaska, only 
25% of the energy in the Hawaiian storm is still un- 
predictable or unaccounted for. The second measure 
gives the average "over time variability" of one sta- 
tion explained by another, while the third measure 
gives the average "over principal components" vari- 
ability explained. 

5. DISCUSSION 

In this paper we present a framework for devel- 
oping and analyzing measures of dependence when 
the goal is to explore and summarize the relation- 
ship between two variables. The framework consists 
of just a few guidelines and an information-based 
methodology designed to achieve those guidelines. 
We demonstrated how many well-known measures 
fit into this framework and presented two real data 
examples that demonstrated two distinct ways in 
which this methodology could be used. The first 
example was based on income and education and 
showed how the context of the problem and the 
goals of the researcher should dictate the chosen de- 
pendence measure. The second example developed 
a new measure of dependence for functional data. 
The measure was developed to ensure that it was 
not only interpretable and informative, but that the 
information it conveyed was highly relevant to the 
context of the problem. 

The present work is in the same vein as the work 
of Goodman and Kruskal, but we go about it in 
markedly different ways. While they focus on mea- 
sures with probabilistic interpretations, we exploit 
more general measures of information. In both cases, 
though, the goal is to develop measures with useful 
and relevant interpretations. A significant motivator 
for the present work was the development of tools 
such as the distance correlation (see, e.g., Szekely, 
Rizzo and Bakirov (2007); Szekely and Rizzo (2009)), 
the maximal information coefficient (Reshef et al. 
(2011)) and copula-based measures (see, e.g., Schwei- 
zer and Wolff, (1981); Siburg and Stoimenov (2010)). 
Such measures provide interesting and powerful meth- 
ods for detecting nonlinear dependence, but are very 
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difficult to interpret. For example, saying that the 
correlation between two variables is 0.5 can have 
numerous useful interpretations, some of which are 
classical (explained linear variability) and some of 
which are less standard (Reimherr and Nicolae 
(2011)). However, knowing that the distance cor- 
relation or the MIC between two variables is 0.5 
currently gives us relatively little insight into the 
relationship between two variables. 

The link between dependence measures and types 
of information will hopefully help open up an array 
of different types of dependence measures for any 
given setting. While researchers can always choose 
more classical measures based on prediction or en- 
tropy, it is important for them to know that alter- 
native measures with distinctly different meanings 
are also available or can be constructed. For exam- 
ple, in the case of missing information, dependence 
measures related to the fraction of missing informa- 
tion can be constructed along the lines of our simple 
example. In the case of hypothesis testing, measures 
related to the relative efficiency of tests based on two 
different variables can be constructed. Such a mea- 
sure can be interpreted ratio of sample sizes 
that yield the same statistical power, a very attrac- 
tive interpretation. Hopefully, more work will follow 
that shows how other important quantities can be 
used to construct new, nonstandard, yet very infor- 
mative measures of dependence. 

It is worth noting that the present work focuses al- 
most exclusively on developing theoretical measures 
based on joint distributions. The issue of estima- 
tion is left almost completely untouched, as that in 
and of itself is a fairly complex problem. Moment 
estimators can easily be used to estimate measures 
such as correlation, however, traditional estimation 
of the correlation ratio or entropy-based measures 
require an assumption about the joint distribution 
of the two variables, which can be a nontrivial prob- 
lem. Nonparametric estimation of the correlation ra- 
tio can be found in Doksum and Samarov (1995), 
however, it is unclear whether more general non- 
parametric estimates of more general information 
functions can be developed, or if estimation must 
be done on a case-by-case basis. Clearly, one should 
take great care when attempting to apply a measure 
whose estimation and inferential properties are not 
well established. 

We have also not discussed the important concept 
of conditional dependence measures, which would 
be useful, for example, when one has a fair amount 



of collinearity between explanatory variables. We 
believe one could adjust the current framework to 
handle conditional dependence measures by adjust- 
ing the spaces T\ and T2 to be conditional random 
variables. Of course, practically one needs a mea- 
sure where one can easily take a conditional expec- 
tation or be able to work with conditional distribu- 
tions. Such a step can be accomplished with nearly 
all measures presented here. However, given the im- 
portance of the problem, we refrain from exploring 
the issue further presently. 

Our hope is that the current work will start a dis- 
cussion on measures of dependence. Whether it be 
in new research or in the classroom, we believe the 
interpretation of a measure should always be empha- 
sized. It allows researchers to better determine the 
relevance of a measure to their analysis, giving clear 
interpretations to help cultivate their conclusions, 
as well as providing intuition and understanding to 
students and nonstatisticians. 

APPENDIX 

Prediction 

Here we show that the information link functions 
given in the prediction section satisfy the assump- 
tions in Definition 1. Assume that X takes values 
from a set X and Y from 3^. Furthermore, assume 
both sets are separable Banach spaces. Let /ux, A*y 
and /ix,y be the probability measures induced by X, 
Y and (X, y), respectively. By definition, property 
3 for information link functions is satisfied. Since all 
constants are included in liX; Y) is positive and 
property 1 is satisfied. Since Y predicts itself per- 
fectly and we assume g{0) = 0, then the first part of 
property 2 is satisfied. To see that I{X;Y) is zero 
when X and Y are independent, we start by show- 
ing that any predictor based on X cannot do better 
that Yq. Consider, for any / such that f{X) E Ti, 

E[g{Y-f{X))]= [ g{y-f{x))dfi{x,y). 
Jxxy 

Since g is nonnegative (we of course have to assume 
g is measurable as well), the integral exists and by 
Fubini's theorem, when X and Y are independent, 
equals 

E[g{Y-f{X))] 

= ^(^^9(2/- /(a^))c?Aty(y)^ dfix{x) 

= [ E[giY-fix))]dfix{x). 
Jx 
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Since is the best constant predictor ofY , we have 
that E[g{Y - /(x))] > E[g{Y - for all x e X 

and, therefore, 

E[g{Y-f{X))]> [ E[g{Y-Yo)]dfix{x) 

JX 

= E[g{Y-Yo)]. 

Thus, any predictor based on X cannot do better 
than % and we obtain < I{X]Y) < I{Yo;Y) = 
if X and Y are independent and, therefore, ah 3 
properties are satisfied. 

Note that the existence of the estimators Y and Yq 
can be guaranteed by placing some requirements on 
g and J^i. If g is a continuous convex function and 
J^i is a finite dimensional, closed and convex set, 
then the existence of a solution follows from stan- 
dard convexity theory. If Ti is infinite dimensional, 
then one also needs that g is coercive to guarantee 
that a solution exists. For more details see any text 
on convex optimization or variational calculus (e.g., 
Gelfand and Fomin (1963); Boyd and Vandenberghe 
(2004)). 

Relative Efficiency 

Here we show that the information link functions 
given in the relative efficiency section satisfy the 
assumptions in Definition 1 and detail the calcula- 
tions for the MCAR example. Assume that the joint 
and marginal distributions are continuously differ- 
entiable. Property 1 for information link functions 
is satisfied by definition. To establish property 2 con- 
sider that 

Ix,Y{e)=lY\x{0) +Tx{9) =Xx\y{0)+Ty{0)- 

Since Y is sufficient for 9, f{X\Y; 9) is constant with 
respect to 9 and we have 

2n 



2x|y(6') = Ee 



d_ 
89 



log f{X\Y- 9) 



0. 



Thus, Zx{9) <Ty{9), which proves I{X;Y) < 
I{Y;Y). If X and Y are independent, then f{X\Y; 
9) = f{X; 9) and is constant with respect to 9. Thus, 
^x{Q) = and the second property is established. 
The third property now follows from property 2 since 
Z is a proxy of X for 9. 

For the MCAR example, the Fisher information 
for with respect to X can be computed as 



Xx(/i) = -E 



■logifiX, Z;f,,a^p)) 



-E 



(1 



E 



-P) 
92 



log(/(X,0;/i,a2,p)) Z = 



log(/(X,l;/i,c72,p)) 



Z = l 



If z 



-E 



0, then X 
92 



: with probability 1 and 



log{f{X,0-fi,a',p)) 



Z = 



il-p) = 0. 



If Z = 1, then X is normal with mean fi and cr^ 
which means 



-E 



■log(/(X,0;/x,a^p)) 



Income and Education 



Z=l 



P = p/^ 



Here we detail how the 0-1 information link func- 
tion from the income and education example was 
derived, as well as proving that it satisfies the as- 
sumptions in Definition 1. We can take J-2 to be the 
set of Bernoulli random variables and J-i to be the 
set of all random variables and vectors. Let Y be 
a predictor of Y based on no information (not con- 
ditioned on any other random variables) and Y(X) 
based on X. If we evaluate Y based on 0-1 loss, then 
our error is 

E[ly^y] = PiY ^Y)= P{Y / o|y = 0)P(y = 0) 

+ P{Y ^l\Y = l)P{Y = 1). 

Since Y is based on "no information," it must be 
independent of Y . Thus, if we let Py denote the 
P{Y = 1) and Qy = P{Y = 1), then 



E[l^^y] = P{Y^m-Py) 

+ P{Y^l)Py = qy{l-Py) 
+ (1 - qy)Py. 

So the best predictor will be the one that minimizes 
the above expression. Taking derivatives with re- 
spect to Qy, we obtain 



1 - 2p. 



which is positive if Py < 1/2 and negative if Py > 1/2. 
So, if Py < 1/2, the minimum error is achieved by 
taking Qy = and if Py > 1/2, the minimum error 
is achieved by taking qy = 1. This intuitively makes 
sense; if the loss is 0-1, then make the predictor the 
outcome with the highest probability. So if we take 
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Y = argmin.y{i-'(y = y)}, then that predictor has 
the lowest 0-1 loss and its error is given by 

E[ly^y] = min{P(y = 0), p(y = 1)}. 

The same argument implies that best predictor when 
conditioning on X is Y{X) = argmmy{P{Y = y\X)} 
and the error is 



where Xj is the jth eigenvalue of Cy . We can express 



as 



Yjj' — /3Xj + Si J + Si J, 

where f3j ,, = {P,Uj Vk), £i,j = {ei,Uj), and Sij 
YlkLp+il^jM^i^^k)- If we define 

p = (X^X)~^X^Y, 



E[lY(x)^yl^] =min{-P(^ = 0|X),P(y = then one can show that p-^ = op{l), Vk 



Vk 



To see that / has the desired properties, simply de- 
fine a function g{0) =0 and g{x) = 1 for any x ^0. 
Then g is a nonnegative penalty function and since 
ly_^y = g{Y — Y) we can use the same machinery 
from the prediction section, and our information link 
function therefore satisfies the assumptions of Defi- 
nition 1. 

Magnetic Substorm Dependence Estimation 

We consider the problem of estimating the func- 
tional dependence measure given data Yi,. . . ,Yn and 
Xi , . . . , Xn ■ We assume that Xi and Yi are functions 
taking values in [0, 1] , are centered and satisfy a 
functional linear model, that is, 

Yi{t) = j /3{s,t)Xi{s)ds + ei{t), 

and that /3, as an operator, is bounded. We assume 
that both {Xi} and {si} are i.i.d., are independent 
of each other, and 

E\\Xi\\'^<oo and EWsiW"^ < oo. 

For the estimation of Di and D2 , we refer to (Ram- 
say and Silverman (2005)). The third measure, D3, 
we have not seen in previous literature so we provide 
a consistent estimator. Define Cy(,s, t) = E\Y(s)Y{t)] 
and Cx{s,t) = E[X{s)X{t)]. Assume that the first 
d+1 and q + l eigenvalues (ordered by magnitude) 
of Cy and Cx, respectively, are distinct. Define the 
projections Yi, . . . , Y^ and Xi, . . . , X„, where 

Yij = {Yi,Uj) and Xj^fc = {Xi,Vk) 

for j = 1, . . . ,d and k = 1, . . . ,p. Here uj is the jth 
eigenfunction of Cy (s, t) and is the kth eigenfunc- 
tion of Cx {uj and Vk are the sample counterparts). 
Notice that Yj j is uncorrelated across j since we are 
projecting onto the eigenfunctions of Cy. Therefore, 
the L>3 measure can be expressed as 



i)3(x,r) = i-d- 



E{{Y,u,) - E[{Y,u,)\X])'' 



op(l), and Uj 



Ui 



op{l); see Horvath, Kokoszka 



and Reimherr (2009) for details. Define the fitted 
values Yj = /3Xj . We then estimate as 



1 



D^{X,Y) = l-d''y' 

j=i i=i 



Y- 



A2 



It is then easy to show that (via Slutsky's lemma 
and the law of large numbers) 



{eij,Uj/ 



+ ^ {l3,Uj(g)Vk) 

k=p+l 




Notice that ideally we do not want the term 



^ {P,Uj (S)Vk){Xi,Vk) 

k=p+l 

in the above expression, but that is the error we 
make in projecting to a finite dimension. But, if we 
choose p large, we can make the term arbitrarily 
small and, in practice, we expect that term to con- 
tribute relatively little to the overall estimate. 

ACKNOWLEDGMENTS 

The authors would like to thank Xiao-Li Meng, 
Radu Craiu, Stefano Castruccio, Ryan King, three 
anonymous referees and the Associate Editor for 
comments that have greatly improved the manuscript. 

REFERENCES 

Ash, R. B. (1990). Information Theory. Dover, New York. 
MR1088248 

Bell, C. B. (1962). Mutual information and maximal corre- 
lation as measures of dependence. Ann. Math. Statist. 33 
587-595. MR0148182 



ON QUANTIFYING DEPENDENCE 



15 



Boyd, S. and Vandenberghe, L. (2004). Convex Optimiza- 
tion. Cambridge Univ. Press, Cambridge. MR2061575 

Cover, T. M. and Thomas, J. A. (2006). Elements of Infor- 
mation Theory, 2nd ed. Wiley, Hoboken, NJ. MR2239987 

Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). 
Maximum likelihood from incomplete data via the EM al- 
gorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 39 1-38. 
MR0501537 

Dorsum, K. and Samarov, A. (1995). Nonparametric es- 
timation of global functionals and a measure of the ex- 
planatory power of covariates in regression. Ann. Statist. 
23 1443-1473. MR1370291 

Ebrahimi, N., Soofi, E. S. and Soyer, R. (2010). Infor- 
mation measures in perspective. International Statistical 
Review 78 383-412. 

Efron, B. (1978). Regression and ANOVA with zero-one 
data: Measures of residual variation. J. Amer. Statist. As- 
soc. 73 113-121. MR0501624 

Gelfand, I. M. and Fomin, S. V. (1963). Calculus of Vari- 
ations. Prentice Hall International, Englewood Cliffs, NJ. 
MR0160139 

Goodman, L. A. and Kruskal, W. H. (1979). Measures 
of Association for Cross Classifications. Springer Series in 
Statistics 1. Springer, New York. MR0553108 

Grey, R. M. (2011). Entropy and Information Theory. 
Springer, New York. 

Hall, W. J. (1970). On characterizing dependence in 
joint distributions. In Essays in Probability and Statistics 
339-376. Univ. North Carolina Press, Chapel Hill, NC. 
MR0266353 

HORVATH, L., KOKOSZKA, P. and Reimherr, M. (2009). 
Two sample inference in functional linear models. Canad. 
J. Statist. 37 571-591. MR2588950 

Lehmann, E. L. (1966). Some concepts of dependence. Ann. 
Math. Statist. 37 1137-1153. MR0202228 

Liang, K.-Y., Zeger, S. L. and Qaqish, B. (1992). Multi- 
variate regression analyses for categorical data. J. R. Stat. 
Soc. Ser. B Stat. Methodol. 54 3-40. MRl 157713 

LiPSiTZ, S. R., Laird, N. M. and Harrington, D. P. 
(1991). Generalized estimating equations for correlated bi- 
nary data: Using the odds ratio as a measure of association. 
Biometrika 78 153-160. MR1118240 



McCullagh, p. and Nelder, J. A. (1989). Generalized Lin- 
ear Models. Chapman & Hall, Boca Raton, FL. 

Moskowitz, C. (2011). U.S. must take space storm threat 
seriously, experts warn. Available at http : //www . space . 
com/ 10906-space-storms-threat . html . 

Nelsen, R. B. (2010). An Introduction to Copulas. Springer, 
New York. 

NiCOLAE, D. L. (2006). Quantifying the amount of missing in- 
formation in genetic association studies. Genet. Epidemiol. 
30 703-717. 

NicOLAE, D. L., Meng, X.-L. and Kong, A. (2008). Quan- 
tifying the fraction of missing information for hypothesis 
testing in statistical and genetic studies. Statist. Sci. 23 
287-312. MR2483902 

Ramsay, J. O. and Silverman, B. W. (2005). Functional 
Data Analysis, 2nd ed. Springer, New York. MR2168993 

Reimherr, M. and Nicolae, D. L. (2011). You've gotta 
be lucky: Coverage and the elusive gene-gene interaction. 
Ann. Hum. Genet. 75 105-111. 

Renyi, a. (1959). On measures of dependence. Acta 
Math. Acad. Sci. Hungar. 10 441-451 (unbound insert). 
MR0115203 

Reshef, D. N., Reshef, Y. A., Finucane, H. K., Gross- 
man, S. R., McVean, G., Turnbaugh, P. J., Lan- 
der, E. S., MiTZENMACHER, M. and Sabeti, p. C. (2011). 
Detecting novel associations in large data sets. Science 334 
1518-1524. 

SCHWEIZER, B. and Wolff, E. F. (1981). On nonparametric 
measures of dependence for random variables. Ann. Statist. 
9 879-885. MR0619291 

Siburg, K. F. and Stoimenov, P. A. (2010). A mea- 
sure of mutual complete dependence. Metrika 71 239-251. 
MR2602190 

SzEKELY, G. J., Rizzo, M. L. and Bakirov, N. K. (2007). 
Measuring and testing dependence by correlation of dis- 
tances. Ann. Statist. 35 2769-2794. MR2382665 

SzEKELY, G. J. and Rizzo, M. L. (2009). Brownian distance 
covariance. Ann. Appl. Stat. 3 1236-1265. MR2752127 

U.S. Census Bureau (2010). Educational attainment — people 
25 years old and over. Available at http : //www. census 
. gov/hhes/www/ cpstables/032010/perinc/new03_001 .htm 



